1 Sample 111L

## [1] "Number of reads: 204399"

2 VJ alignment

## [1] "Execution time: 2.066465 hours"

Alignment summary

HVR can be delimited if V and J segments and their conserved Cys and Phe codons are identified in the read. This is the case for most of the reads (98.1%). Also, it is more common for Cys and Phe codons to be in frame (Frame 0).

Only 1.9% of the reads lack of at least one of the four mentioned elements. These reads are discarded from the analysis.

Detailed results of the vj_alignment function are shown in the following tables:

Status # Reads
HVR found 200445
HVR not found 3954
Status # Reads
Frame 0 171402
Frame 1 11106
Frame 2 17937
No Cys 1297
No Cys no Phe 35
No J 918
No Phe 1643
No V 61

Unique reads

Reads with delimited HVR (200445 reads) can be grouped into unique clonotypes (same V and J match and HVR sequence). This results in 43699 unique reads:

VJ families

There are 990 VJ families in the sample. A heatmap with the number of unique HVR sequences per VJ family is shown here:

One-membered families

140 VJ families with only one HVR sequence are excluded from the clustering step. These families are shown here:

Therefore, 850 VJ families undergo the clustering.

3 HVR clustering

Affinity propagation (AP) is an algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges. More information about APC and the 5 approaches tested here could be found in HVR_clustering_sample_111H notebook.

Total number of exemplars is shown at the bottom of each table (number of entries).

q = 0

Minimum p value (tends to underestimate the number of clusters).

q = 0.25

q = 0.5

Median p value (tends to overestimate the number of clusters).

q = 0.75

Between median and maximum p value (tends to overestimate the number of clusters).

q = 1

Maximum p value (tends to overestimate the number of clusters).

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

Adjust p

Specify p for each HVR based on abundance (n) modified to be in the same range as the similarity matrix s. Proposed formula:

\[ p_{i} = min(s) + \left(\frac{(max(s) - min(s)) · (n_{i} - min(n))}{max(n) -min(n)}\right) \]

Adjust s

Modify s according to Zhang et al paper (section 1.2.1).

\[ s'_{(i,j)}=\begin{cases} -n_{i}\: d_{(i,j)}^{2} & \text{ if } i\neq j \\ s^{*} & \text{ otherwise } \end{cases} \]

where \(d\) is the distance between sequences and \(s^{*}\) is a small constant called preference (in our case the principal diagonal is 0).

4 Comparison with MiXCR results

MiXCR unique clonotypes: 7020 (multiple reads) + 1084 (one read) = 8104

The APC model that yields the closest number of exemplars would be q = 0.5 (N = 7023). Our best model according to the supervised measurements was "Adjust p" (N = 3387).

  • MiXCR valid reads: 147445
  • Our valid reads: 200445

For the clustering step we use 53000 reads that MiXCR discarded.

5 Dendrograms

Dendrograms of 850 VJ families with more than one distinct HVR sequence were plotted for manual annotation (all files available in /media/bacon/Carazo_TCRSeq_IonTorrentS5/03_sequenceAnalysis/april_2020/plots/dendrograms_sample_111L_weighted_hclust_wardD). The method used was a weighted hierarchical clustering with ward.D linkage, as it showed to be the best linkage method for sample 111H.

~80 dendrograms were randomly selected for manual annotation. The Google Docs presentation with the annotated dendrograms (work in progress) is here.

6 Model evaluation

For clustering models evaluation and comparison, "true" exemplars were manually selected for 75 (out of 87, work in progress) randomly selected VJ families in sample 111L. Selected exemplars on dendrograms can be seen here. The following table contains a summary of the manual exemplar selection (with comments in special cases):

With this "ground truth", we can evaluate the clustering models in terms of supervised classification metrics such as:

  • Accuracy: ratio of correctly predicted observation to the total observations. \[accuracy = \frac{TP + TN}{TP + TN + FP + FN}\]

  • Recall (sensitivity or true positive rate): ratio of correctly predicted positive observations to the all observations in actual class. \[recall = \frac{TP}{TP + FN}\]

  • Precision: ratio of correctly predicted positive observations to the total predicted positive observations. \[precision = \frac{TP}{TP+FP}\]
  • F1 score: weighted average of precision and recall (this score takes both false positives and false negatives into account).

\[F_{1} = 2 * \frac{recall · precision}{recall + precision}\]

7 Tunning q for each VJ family

The results with 111L are not as good as with sample 111H, which makes us think that the differences in diversity directly affect the clustering process. Since q parameter (indirectly) determines the number of exemplars, q should be higher in larger/more diverse families to choose more exemplars. The problem is to estimate the family diversity in an uncorrected set of reads. Three estimators are proposed:

  • Number of unique HVR
  • Shannon's index (diversity)
  • Functional diversity: this approach was published here.

Preliminary analysis

The following figures are interactive and show the relationship between classification metrics (accuracy, recall, precision, F1) and the three proposed estimators. Performance metrics can be hidden or shown by clicking them on the figure legend.

Unique HVR

## `geom_smooth()` using formula 'y ~ x'

Shannon's Index

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 40 rows containing non-finite values (stat_smooth).

Functional Diversity

Chiu and Chao paper (2014)

A diversity order, \(q [0, 3]\), should be specified. \(q = 2\) was chosen arbitrarily.

## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 40 rows containing non-finite values (stat_smooth).

¿Which functional diversity q value gives the best correlation between functional diversity and true number of HVRs?

## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 4 row(s) containing missing values (geom_path).

hillR package


¿Is there a relationship between any of the estimators and the optimal q parameter?

Let's check which q value gives the best model for each family and see if there is a correlation with unique HVR / Shannon's Index / Functional diversity. The chosen metric to evaluate the models is F1, since our dataset is very imbalanced (low number of exemplars or true positives and large number of non-exemplars or true negatives).

It seems that functional diversity presents the best correlation with the optimal q parameter.

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

Unique HVR

Y axis was log10-scaled for better visualization.

Shannon's Index

Functional diversity

Y axis was log10-scaled for better visualization.


APC with q based on functional diversity

In a similar way that we adjusted p for each HVR based on its frequency (bringing the frequency to the similarity matrix values range), we could adjust q for a VJ family bringing its functional diversity value to the range 0-1. Since this transformation is lineal and a lineal correlation with q was observed with log10-transformed functional diversity values, here the values should be log10-transformed as well.

Weighted (q) is the second best model in terms of accuracy and F1 (q=0.75 is the best one).

Levenshtein distances distribution per family

Histograms of Levenshtein distances distribution for each VJ family are available in /media/bacon/Carazo_TCRSeq_IonTorrentS5/03_sequenceAnalysis/april_2020/plots/levenshtein_distance_histograms_111L/.

Adaptive Affinity Propagation Clustering

From Fenias-Moiane et al 2.2 Section:

Adaptive Affinity Propagation (AAP) algorithm is an AP variant developed to deal with issues related with setting the initial preference value p which is determined by taking the minimum or the median of the similarity S(i,j), and occurrence of oscillations in the original AP algorithm. The choice of preference influences on the final resulting number of clusters, and the oscillation may often cause the process to fail to reach the convergence.

  • AAP pseudocode is available in the previous source.
  • MATLAB implementation is available here.
  • Scikit-learn forum entry about AAP.

Session info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] Rcpp_1.0.5            tidyr_1.1.2           vegan_2.5-6          
##  [4] lattice_0.20-38       permute_0.9-5         plotly_4.9.0         
##  [7] reshape2_1.4.4        DT_0.16               apcluster_1.4.8      
## [10] WeightedCluster_1.4-1 cluster_2.1.0         TraMineR_2.2-0.1     
## [13] bsselectR_0.1.0       stringr_1.4.0         ggtree_2.0.4         
## [16] plyr_1.8.6            ggpubr_0.4.0          ggplot2_3.3.2        
## [19] webr_0.1.6            dplyr_1.0.2           msa_1.16.0           
## [22] Biostrings_2.52.0     XVector_0.24.0        IRanges_2.18.3       
## [25] S4Vectors_0.22.1      BiocGenerics_0.30.0  
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1                uuid_0.1-4                 
##   [3] backports_1.2.0             Hmisc_4.2-0                
##   [5] systemfonts_0.3.2           lazyeval_0.2.2             
##   [7] splines_3.6.1               crosstalk_1.1.0.1          
##   [9] BiocParallel_1.18.1         GenomeInfoDb_1.20.0        
##  [11] digest_0.6.27               htmltools_0.5.1.1          
##  [13] magrittr_1.5                checkmate_2.0.0            
##  [15] openxlsx_4.2.3              readr_1.4.0                
##  [17] matrixStats_0.57.0          officer_0.3.15             
##  [19] jpeg_0.1-8.1                colorspace_1.4-1           
##  [21] haven_2.3.1                 xfun_0.19                  
##  [23] RCurl_1.98-1.2              crayon_1.3.4               
##  [25] jsonlite_1.7.1              rrtable_0.2.1              
##  [27] survival_3.2-7              zoo_1.8-8                  
##  [29] ape_5.4-1                   glue_1.4.2                 
##  [31] polyclip_1.10-0             rvg_0.2.5                  
##  [33] gtable_0.3.0                zlibbioc_1.30.0            
##  [35] DelayedArray_0.10.0         sjmisc_2.8.5               
##  [37] car_3.0-10                  DEoptimR_1.0-8             
##  [39] abind_1.4-5                 scales_1.1.1               
##  [41] rstatix_0.7.0               miniUI_0.1.1.1             
##  [43] viridisLite_0.3.0           xtable_1.8-4               
##  [45] htmlTable_2.1.0             tmvnsim_1.0-2              
##  [47] tidytree_0.3.3              foreign_0.8-72             
##  [49] Formula_1.2-4               vcd_1.4-8                  
##  [51] htmlwidgets_1.5.2           httr_1.4.2                 
##  [53] RColorBrewer_1.1-2          acepack_1.4.1              
##  [55] ellipsis_0.3.1              pkgconfig_2.0.3            
##  [57] farver_2.0.3                nnet_7.3-12                
##  [59] sass_0.3.1                  labeling_0.4.2             
##  [61] tidyselect_1.1.0            rlang_0.4.11               
##  [63] later_1.1.0.1               munsell_0.5.0              
##  [65] cellranger_1.1.0            tools_3.6.1                
##  [67] generics_0.1.0              devEMF_4.0-2               
##  [69] sjlabelled_1.1.7            moonBook_0.2.3             
##  [71] broom_0.7.6                 evaluate_0.14              
##  [73] fastmap_1.0.1               yaml_2.2.1                 
##  [75] knitr_1.30                  robustbase_0.93-6          
##  [77] zip_2.1.1                   purrr_0.3.4                
##  [79] nlme_3.1-140                mime_0.9                   
##  [81] xml2_1.3.2                  compiler_3.6.1             
##  [83] rstudioapi_0.11             curl_4.3                   
##  [85] png_0.1-7                   ggsignif_0.5.0             
##  [87] treeio_1.8.2                tibble_3.0.4               
##  [89] tweenr_1.0.1                bslib_0.2.4                
##  [91] stringi_1.5.3               highr_0.8                  
##  [93] forcats_0.5.0               gdtools_0.2.2              
##  [95] Matrix_1.2-17               psych_2.0.9                
##  [97] vctrs_0.3.8                 pillar_1.4.6               
##  [99] lifecycle_0.2.0             BiocManager_1.30.10        
## [101] editData_0.1.2              lmtest_0.9-38              
## [103] jquerylib_0.1.3             bitops_1.0-6               
## [105] data.table_1.13.2           insight_0.10.0             
## [107] flextable_0.5.11            ztable_0.2.2               
## [109] GenomicRanges_1.36.1        httpuv_1.5.4               
## [111] hwriter_1.3.2               R6_2.5.0                   
## [113] latticeExtra_0.6-29         ShortRead_1.42.0           
## [115] promises_1.1.1              gridExtra_2.3              
## [117] rio_0.5.16                  boot_1.3-23                
## [119] MASS_7.3-51.1               SummarizedExperiment_1.14.1
## [121] shinyWidgets_0.5.4          withr_2.3.0                
## [123] Rsamtools_2.0.3             GenomicAlignments_1.20.1   
## [125] mnormt_2.0.2                GenomeInfoDbData_1.2.1     
## [127] mgcv_1.8-28                 hms_0.5.3                  
## [129] grid_3.6.1                  rpart_4.1-15               
## [131] rmarkdown_2.7               rvcheck_0.1.8              
## [133] carData_3.0-4               ggforce_0.3.1              
## [135] Biobase_2.44.0              shiny_1.5.0                
## [137] base64enc_0.1-3